Okay. Hmm so we all attended Interspeech, no? So how many sessions you attended? I mean did you attend any any session? Yeah. That's really that's really your So any general impression or feedback? So f for any of you it's a first conference or maybe for? No, I mean in the sense that it's uh Interspeech is really big conference for speech, so did you attend before? Uh, ok ok yeah, o ok yeah, so then you're hiding a you have attended before yes. Yeah, I see see. Yeah, yeah. I t I Uh I think it's quite interesting, but only annoying thing is this multiple sessions. And most sometimes you can't able to go to oral talk this oral presentations most of the time, because because like posters is you can spend a lot of time posters looking at m many posters than sitting twenty minutes for one oral presentation. So in t twenty minutes you can see at least two, three posters, and you can directly talk to people, so. Even I found like uh very few people in some oral presentations. I think most of the people are like their own posters only. Yeah, how presentations are hardly handful of like f five or more. Yeah. Ah, ok. Hmm yeah. Ah, okay. Yeah. Even some panel discussions on that human speech recognition reducing gap between the A_S_R_ and and H_S_R_. It was a bit interesting, lot of arguments and Yeah, b differently it's mainly the differences different approaches of engineers versus linguists or phoneticians and yeah mm. Ah, okay. Yeah, even the panel discussions w I think one is really held in small room, so people were really crowded. Yeah, it's Yeah,. Yeah. Ah ok we went to like far south, to Lagos and yeah, that wa uh ha yeah. That was really good, those beaches are really good. Yeah, even dolphin in the yeah. But the weather was really hot, the south it's more than thirty five. Uh. Lisbon was good, um little bit bit mm. Yeah, even local transport, it's it's. But yeah, most of the time the buses are really crowded. Uh. Yeah. Yeah. Yeah, because it's really big city, no? No like. Mm. Yeah. Hmm. Yeah. Yeah. No, re yeah, but that's that's not good, direct bus is not good like. So you can go to the direct uh ce central place and then occas uh. Mm. Yeah, da Uh what is the place, uh the the central I mean where we change the bus to yeah, in the yeah, maybe you can check the booklet. Yeah it's there, like I used to find bus numbers from book only. Oh, okay. Oh, okay. It's like uh in It's like uh all the semi-tied covariance matrices or yeah. So did they reduce it's in between the direct covariance this is full covariance. Should try to reduce some parameters by tying. Oh. Uh. Ah, okay. Yeah. Mm. In your tie-in. No, I think they got good results with just using diagonals or definitely yeah. Yeah. Mm. Maybe decorrelation again they do D_C_T_ or K_L_T_ or they do that? L_D_A_ or like Yeah. So No, but still there is there is still ev that's why like people come Yeah, definitely it's not optimum. So it's it's also like doing along along the diagonals all again. It's really computationally really uh yeah so see suppose if thirty nine in te thirty nine to thirty nine for every yeah, yeah, so Uh there is a thing Yeah. Ne no problem, but if you see so many models and so many mixtures yeah. But Yeah, thirty you can see you c you can't really like even you can't the models itself is like thirty eight times more than that, so. So y if you have like five megabytes or ten megabytes of models then yeah. But even I think it's really bit like impossible for the really big systems I guess, like storing storing itself. Yeah yeah, yeah yeah, it's really ti Yeah yeah, for every you're so you're to store all this thirty Yeah. So from Mm-hmm. Mm-hmm. Speak uh yeah, different task and So uh I found one paper interesting. I think you know Vivek most of you know. Uh this is uh Vivek's paper on like variable scale feature extraction. Normally we use to fix it window for feature extraction, like twenty five milliseconds or thirty seconds. Here he's proposing variable scale, because the fixed scale is non-optimal for it's non-optimum because like for vowels you can have m much longer, and for plosives and these things you really shorter, like even around less than twenty milliseconds also. So he's ki proposing like this variable scale window f uh kind of online for each segment. Yeah mm mm you can do like you can measure and you can do but he's mm basically doing some mm likelihood ratio testing. So if so the main idea is like suppose you have uh one segment, so he's trying to find the stationality cautious stationality of that segment. Yeah, as much as possible like but yeah, definitely you have to assume some l uh like minimum and maximum sizes of your own, so he is using minimum of minimum is uh I think twe I don't twelve point five milliseconds I think. Maximum is uh sixty milliseconds. So the minimum No no, I think a minimum is twen twenty milliseconds is i yeah. Twelve point five milliseconds is kind of sift. Twenty millis minimum is no no, de because of uh No, you can use mol yeah. No, mainly because of uh M_F_C_C_ computation, because yeah, it becomes really noisy, like if you s yeah, ten milliseconds means you have only eighty samples for uh and then you have refused twenty four filters, the filters won't get any samples, so. Yeah, because of computation. So even I I think even st uh he can find less than twenty millisecond windows also, especially for plosives uh. Shifting yeah, he's using twelve point five millise yeah, shifting is almost same. But the problem like He Yeah. Yeah, he's keeping same number of frames. Uh he's using uh twelve point five yeah, twelve point five millisecond Yeah. Number of uh frames? W you want to Actually the problem's again uh uh you see the shift is the Nyquist frequency if you compared modulation spectrum. So again, if you change this shift, the mod the Nyquist the modulation spectrum, Nyquist frequency changes for each window. So if you want to do again another high-level feature extraction again, it will be problem, so. So but he's uh like he's he discuss I mean describing this, because even this itself is a problem, keeping the f uh fix to frame size, because you're analysing your this uh segment many times. So this shift will suppose if you find one segment of sixty milliseconds, and then you're doing this ten millisecond so every almost I think yeah, mm uh yeah, five frames y you're analysing the previ this already segment so. So again, this may blur some frequency transitions or yeah, but the problem is again this uh bottleneck is here, like you can't change your Nyquist frequency modulation and Yeah. Yeah. Yeah. So it's so uh so the main idea is like first to take the some some window, some sam some samples, then he assume somewhere some variable and point there is a change. Then then it's not symmetric, i you can cho uh y it's basically uh too like again, do for all the samples in that gyro. So then what he does is like he proposed uh like some like ratio test based on maximum likelihood. So th tha that is really simple, like so what he does is like so he he computes the residual, he first he does the L_P_ analysis and then he computes the residual, and he takes the residual energy of f full signal like gy gy gyro samples. The full signal. So maybe I can The wh the whole window. No, that's that like he assumes yeah, sixty milliseconds or something he start with, so so then he can yeah, yeah. Yeah, then he can split that window at point N_ say point N_, so then you will have gyro two no, you can segment it maybe. Yeah, so yes, he's supposed this is your p window. So then you can move your like this. You can move your point, so that then this will be like one and this will be second. But again, he assumes some initial uh sizes for these windows. Yeah. He don't start with the yeah, he don't s yeah, start with gyro sample or something. He's starting with uh so the l left window is starts at twenty milliseconds, so so you already hear like s suppose this is full signal, you're already here. So right window is should end at twelve point two, so you're to search basically in this range. So Yeah. Yeah. Yeah, and then if you the maximum is sixty millisecond frame for him. So if you don't find anything Sixty milliseconds. If you d No, the point is depending on the likelihood ratio. So he got so he supposed this so this is suppose se segment one, and this is segment two. So he compute this error here. So and then he p gives some kind of in terms of this residual letter. So this residual letter is for full gyro samples. Then there would be suppose this point is say N_. So gyro to N_ and then like So this the b basically the main he questions or here Yeah, he's for finding the likelihood for full frame, and then it's basically likelihood ratio test, so he's comparing the likelihood of the full segment divided by the likelihood of the the sub-segments. Likelihood of the uh like this likelihood is uh this error estimate of the residual. So residual power he can compare t so he after do after doing L_P_ he can get a l residual and then he can compare the power of the residual. So that he's proving that power is again maximum likelihood estimate of your L_P_ parameters. So that which interesting it's you don't really need to do a lot of computation, you just need to take the error of n uh residual of this full window and then residual of energy of the sub-windows and then you can just divide them and then the only thing is like again he has to use some threshold to decide, that that's only the p problem. He got uh he's founding something more uh around three or three point five E_S_T_ optimum threshold, but the and another advantage is like it's not really changing speech recognition, whatever, so not really changing because of the threshold, so. Uh you don't really need to f like fiddle with threshold a lot, so mm. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, this is yeah, I understand. This is this is the actually basically some theoretical proof, like maybe if you want threshold you can see and he's coding from uh Leven like uh. No, this is from again yeah, statistical signal processing. What they say here is like suppose if you analyse two distinct or uh this L_P_ analysis in the same stationary analysis window, the coding will always be greater than the ones resulting from this analysis in two windows. Two stationary windows. So the he's basically based on this theorem, so so what he's saying is like if you do this error, it will be always greater than the mm th these errors of two stationary windows. Yeah. Yeah. Yeah yeah yeah yeah. Yeah, this maybe I think yeah, this he's getting some improvement, but Yeah yeah yeah yeah. So he's just uh uh he's normalising the energy coefficient, because energy is really affected a lot. Uh so he's li Energy like, because C_ gyro component like, so he's using M_F_C_C_ so he's normalising the C_ gyro component for this t yeah. Power is again like yeah yeah yeah yeah, it's time like yeah. It's okay like samples, or you can see root mean square energy or something. Yeah. Yeah. No. No, n no, you vo what you mean like the window lent. The sixty mil no, it i it doesn't really Uh but I think it's he he doesn't use I guess, because uh uh then like you just you take these features and you train model, so. It really matters wh what frame shift you're using for models, like because how many he needs to use same freq No no, wait. No, he will get like every ten milliseconds he get one fra one features, like he it doesn't really depend, because suppose if you take your t case, like you your window i maybe longer, so doesn't really matter like how what size window you use or not. But I tell uh at every ten milliseconds whether you give some features or not really matters, no? Like so. So it doe you can use uh thirty milliseconds or fifty millisecond window. Same. Yeah yeah yeah. Yeah yeah yeah yeah yeah. Yeah yeah, framing it's like it's really yeah. Yeah, if you have if you're change framing, it's really I think even it's really problem, no? Like if you try even models also I guess, if you change. It's not only with this model, it's in spectra and shifts in deltas, because Yeah, but it's really complicated I think. If you use yeah, that becomes really Yeah. But this is kind of interesting. Yeah yeah yeah, the temporal dec. Mm-hmm. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, first this is proposed by like uh even we read one paper in our reading group, you remember that? I have implemented, it's quite working well, like uh yeah. Mm. Yeah. Yeah, Honza is also worked for his be yeah. Uh i yeah. Ne he was re he was quoting But that Yeah, he was uh quoting that paper also, like v he was telling this uh this is kind of optimisation criteria what this temporal decomposition is doing like. You're trying to segment your s signal into like discreet windows and then so but he was telling like this uh relationship between optimation and and then cautious stationality is not really obvious. So stationality is again different, no? Stationality is this is optimation, we want to like see some signal, few segments, which can really represent whole, so maybe that's why like this may not be really Yeah. Yeah, this one yeah, he's getting some improvement n uh this is a, yeah. No no no, they again But again uh, getting that ten papers is really difficult task. It takes maybe years and again uh it definitely like who will who will choose the ten papers like, and if we asked yeah, but yeah, those things are practically kind of really but I think this kind of things what you showed I mean what you said is really good, like if people start implementing like some people propose something in feature level, some propose in something in model level, but these two are really independent, no? When really work combine these two or Yeah, especially features, like suppose if somebody come with different good features, again people st again use M_F_C_C_s No no, it it's not really sure also, make suppose uh people you can't really force people to use same features, so so they wo they will be happy with their own features and their own scripts, so their bit rate like tend to change features every time and it's Hmm. But even NIST you can Yeah. Yeah, yeah. Then So at lea especially L_V_C_S_R_ in really big systems, then people Yeah. Yeah. Because yeah, it's it's again you don't know. Oh. No, then uh there will be only one conference in three years or something for speech, because But again like That's so you can publish only once in three years or Yeah. But even people are doing suppose in Interspeech there was uh some challenge for speech synthesis, so so what's is uh like it challenge was really good, like sup they give the database, so you have to work on that database. What technique t use is it's your choice like, but that database and then the results analysis is they decide, so. Even speech recognition also some tests are coming, like for phone segmentation or something. People give some database and then you have to dis you can use whatever you want and you have to produce even for features also like features also I think Yeah yeah, the they they designed the data such a way that it's it's really like really real uh data. No no, for s nay, speech synthesis it's like they give you some data, so whether like you use for training or l you can you can do like data driven or like model based approach or whatever like. At the end then they will ask you to synthesise some sentence and you have to synthesise them and then you have to send them Your training data, onl no development, training, no yeah, training Evaluate it's again subject to s uh because for speech synthes yeah. Yeah, listen and yeah d yeah, even uh they it's mostly they use native speakers only, because they can really judge well, so. Yeah yeah yeah yeah. Yeah, yeah yeah. So even they propose some kind of word error rate. So you have the original speech, so i the synthesised speech, is there any words which are not matching? So even speech synthesis, they're also uh introduced this W_E_R_ term, so. Uh. But this was really quite successful like in this conference, Interspeech. So a lot of people participated in uh I think even they're continuing this for uh next year and Text to speech. But uh I heard like even for I_CASSP next year there is some com uh like computation by Martin Krug, Sheffield, and on this feature extraction stuff and so at least if you make task simple focus, then it may be good to compare these features and but if you s use some uh large vocabulary system, it takes t six months to build and at the end you don't know whether your features are really uh so. So maybe that's for like people are always using P_L_P_s or M_F_C_C_s, it's it's such a b like lot of time involved, so you can't really check many thi Yeah, yeah yeah yeah. But at least in IDIAP we have this numbers recogniser is almost kind of free, so everybody is using so that's what we are doing, no? Almost almost like we are putting different features and I think yeah, yeah, numbers recogniser is not really different from like uh a f at least in IDIAP we have almost same, but maybe you are using twenty nine, that's for you it's different, but one more But at least i twenty seven. Hemant and we're all using twenty seven, so. Yeah. Yeah. No, but the problem is you're using only digits, so maybe that's how you No, but test set is digits, your m main task is digits k yeah, that's uh like but O_G_I_ numbers are there like now we're little bit convulsed, like we're using twenty seven phones and before it was like twenty four, twenty five. But it's better, like once you have like whatever the back end, then you can put features like whether they're gammas or like spectral entropies or M_F_C_C_s or whatever. Then at least you can see Uh? Yeah yeah yeah. But Aurora is Aurora task. So but Aurora, is it really big database, or I mean how much time it takes to set up system and then It's fast, no? But the problem is Aurora again, these models are word based models, no? Yeah, so again here we use triphones and so but definitely, I d if we want to show noise setting, it's you have to show on Aurora also, like Aurora is real time. Yeah. Yeah. Yeah, yeah yeah. But yeah Yeah, at least I mean wha those studies I think people already did for even in I_C_ I_C_S_L_P_ two thousand two there is special session on features for Aurora, so yeah, yeah, i yeah yeah. This ICSI features and this uh s uh s uh O_G_I_ features and all this. So there are already people but again like uh, at the end like for L_V_C_S_R_ people are using P_L_P_s or M_F_C_C_s or nothing of these fancy feature Yeah. New But maybe my have a f for publishing another paper or something. Mm. Yeah. But maybe Guillaume, you can tell, no? Like uh you worked i with these Siemens people or no, Siemens or these Daimler Chrystler. What kind of features do you use, like do you use um any you don't yeah, you don't Ah ok okay. Okay. Ah ok Yeah, even s yeah. I didn't discuss anything with Sunil also, like I dunno, technical discussions, anything. these kind of things are a bit, yeah. Yeah, yeah. But do you have any like feeling that whether they go for all these new features or they usually use only mm. Maybe yeah. Mm-hmm. Hmm. But even they don't believe like with one paper or two I guess, because at least for them no P_L_P_s or M_F_C_C_s, they know that okay, these things work on every task, so okay, we can pu put hands on these things yeah. Huh? Yeah, yeah yeah yeah. Maybe like yeah, f because Sunil is working in Nokia, he's a expert in kind of features or yeah, it l I think. But yeah, we're not sure, even Sunil won't tell us. No, no n no, this uh Cohen, you mean like uh uh Hmm. Oh ok Huh. Uh. But this was signal company, no like? This voice signal company, Hy Hynek always mentions uh Cohen's company with what they're using, you n yeah. Voice no no, voice signal. They make uh recognition for these Samsung phones and yeah. Uh they have like really s many recognisers I guess. Maybe definitely they may be using P_L_P_s or you know. Yeah RASTA or RASTA or something you mean. Mm. Yeah, because mm. Yeah, but again Yeah, too new is really definitely because like want to at least see i at least four, five years of results of some new features and some consistent results or Ah. Uh space or ah, okay. Yeah yeah yeah. Mm. Yeah. But yeah, d Yeah, uh another issue is like again the computational issues, no? If they want to make on mobile, they don't want really use some fancy, really expensive computationally expensive features or all these Yeah yeah, it It's just phone recognition or something like Yeah. But most of the time mobiles, they use D D_T_W_ kind of thing, no like? All these uh so voice calling yeah, voice calling and these thing, because that's easy like, because they don't really need to put lot of uh memory and these things, otherwise if they want to build a real Yeah yeah, yeah. Where where? Mm. Yeah yeah. Yeah, Cohen, yeah, yeah. It's Samsung, some some new really new phone like, mm. Yeah. But uh the software is always available or like No, the software is again on top of mobile, or like it comes with your mobile No, the thing is usually these kind of these kind of facilities are like no not normal, so so they may charge more or like yeah, yeah. Yeah. Ah, ok yeah, definitely they may charge for more money f again, some extra bill for this using this recognition engine or something. Yeah. Yeah. Yeah. Yeah, but maybe like if it can't really recognise you it will be really annoying it's better. Yeah, I have used once like this dragon, that was kind of good like, yeah. Because even in when I was working in uh like in Edinburgh like, one guy, he use i he got some problem with hands and then he was always using this n uh dictation even he used to write lot of C_ plus plus code with the dictation like. It's really good like, so. But you ought to really train and then you're to kind of uh yeah. I get used to the system. Yeah yeah, it's really like you ought to get used to the system, like you how to say all the callings and this thing system how t but it's really funny like, he is able to use it for like like years, I mean he used to write a lot of code and, you know, it's really great. Yeah yeah, you can listen Festival. Read the like Mm yeah. Yeah yeah, Fes yeah, I think I think you can do the Festival, you can just call like uh this there are different kind of again, how you call the Festival, so you can just give the full there is an too. So then you will get the all the caller and other things. If if you want to like read full text, then you can say like the full text, like those syntax and then it will speak till it finishes the text. Yeah, it's real it's yeah. Diphone, yeah, yeah. That's why it's real time, like otherwise it's bit difficult and it's okay, you can easily understand, so it's intelligible, so. Uh Festival is good, like now it comes at all the Linux boxes also, so in so already in this No the there are so many voices, like again Which voice you want to use. No, most of the the Festival, uh what you get on Linux machines, is diphone based only. So they supply few voices. Yeah yeah, yeah, just uh concatenation of the no, it's already there, if you just put li Festival, it comes wi because it's comes with direct Linux software and you know, it is open source. Festival is open It's Yeah, cat voice, like there are like they they have few Yeah, new voices like, yeah. Yeah, that's the difference that tha in views this unit selection. It's not diphones, you have really large inventory of the speech. Uh those voices, they're not yet released. Yeah. No no, it's free, they'll be releasing soon I guess. So maybe you can just check no, i it's it's here on the web, then it won't come with your Linux software, so you have to again download from uh their web page. Yeah, you can download, now. Yeah, yeah yeah. It the voices are called Multisyn, Multisyn uh sl uh that's some voice. So you can download from the voice from the web Festival. Yeah yeah yeah, yeah. Yeah yeah, yeah. So voice is like the database of, the parallel speaker who speaks that this text, so. It was really large database, so it the voi the quality will be better because you can find similar yeah, it's more or less but not exactly like uh it's two times or one point five times or something. But you can prune like how much you want, because so this is again like uh that's kind of my P_H_D_ work how in this system like how you choose the units and how you concatenate. So there are again cost functions and Viterbi. So you can you can always optimise, you can put some thresholds and pruning and then you can make it real time. So then again, compromise uh between the quality and yeah. But still it's good like now. Th you can just check Multisyn and then voices. Uh yeah, yeah, it's really good. Yeah yeah yeah yeah. But only sometimes like it can't really call this yeah. Uh standard text. Yeah. Yeah, ASCII file is ASCII and but only the problem's again acronyms and sometimes it expands, sometimes it may not expand, because it's not in their dictionary, so yeah. It's not in the dictionary, like if we again it's problem like, so. But those things you can add, like if you're really familiar with Festival, so you can add always these Yeah. But at least these things okay like, it won't really make mistakes so often like, sometimes Alan Black and Paul Taylor, they started r Festival in ninety six or something. Edinburgh C_S_T_R_, yeah. Uh f oh uh this ma this Johann Waters or like no, they d what they did Portland O_G_I_, they did L_P_C_ based synthesis. The Festival is alrea already like it's one first you start with th Ah, okay. Mm. From. I don't know. No no, actually the b Ah, okay, okay. Yeah, maybe, because Ah, ok Hmm. Yeah, first they started in C_S_T_R_ with Alan Black and Paul Taylor, then then it expand to many places, because if somebody is fil no, Simon King is he's one of the others, but Rob Clark is the man that you met um. Uh uh, already. Yeah. It's So anything uh more about Lisbon or Interspeech? I it's enough, yeah. Okay.